{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Speech Command Classification with torchaudio\n\nThis tutorial will show you how to correctly format an audio dataset and\nthen train/test an audio classifier network on the dataset.\n\nColab has GPU option available. In the menu tabs, select \u201cRuntime\u201d then\n\u201cChange runtime type\u201d. In the pop-up that follows, you can choose GPU.\nAfter the change, your runtime should automatically restart (which means\ninformation from executed cells disappear).\n\nFirst, let\u2019s import the common torch packages such as\n[torchaudio](https://github.com/pytorch/audio)_ that can be installed\nby following the instructions on the website.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Uncomment the line corresponding to your \"runtime type\" to run in Google Colab\n\n# CPU:\n# !pip install pydub torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html\n\n# GPU:\n# !pip install pydub torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nimport torchaudio\nimport sys\n\nimport matplotlib.pyplot as plt\nimport IPython.display as ipd\n\nfrom tqdm import tqdm"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\u2019s check if a CUDA GPU is available and select our device. Running\nthe network on a GPU will greatly decrease the training/testing runtime.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nprint(device)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Importing the Dataset\n\nWe use torchaudio to download and represent the dataset. Here we use\n[SpeechCommands](https://arxiv.org/abs/1804.03209)_, which is a\ndatasets of 35 commands spoken by different people. The dataset\n``SPEECHCOMMANDS`` is a ``torch.utils.data.Dataset`` version of the\ndataset. In this dataset, all audio files are about 1 second long (and\nso about 16000 time frames long).\n\nThe actual loading and formatting steps happen when a data point is\nbeing accessed, and torchaudio takes care of converting the audio files\nto tensors. If one wants to load an audio file directly instead,\n``torchaudio.load()`` can be used. It returns a tuple containing the\nnewly created tensor along with the sampling frequency of the audio file\n(16kHz for SpeechCommands).\n\nGoing back to the dataset, here we create a subclass that splits it into\nstandard training, validation, testing subsets.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from torchaudio.datasets import SPEECHCOMMANDS\nimport os\n\n\nclass SubsetSC(SPEECHCOMMANDS):\n    def __init__(self, subset: str = None):\n        super().__init__(\"./\", download=True)\n\n        def load_list(filename):\n            filepath = os.path.join(self._path, filename)\n            with open(filepath) as fileobj:\n                return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj]\n\n        if subset == \"validation\":\n            self._walker = load_list(\"validation_list.txt\")\n        elif subset == \"testing\":\n            self._walker = load_list(\"testing_list.txt\")\n        elif subset == \"training\":\n            excludes = load_list(\"validation_list.txt\") + load_list(\"testing_list.txt\")\n            excludes = set(excludes)\n            self._walker = [w for w in self._walker if w not in excludes]\n\n\n# Create training and testing split of the data. We do not use validation in this tutorial.\ntrain_set = SubsetSC(\"training\")\ntest_set = SubsetSC(\"testing\")\n\nwaveform, sample_rate, label, speaker_id, utterance_number = train_set[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A data point in the SPEECHCOMMANDS dataset is a tuple made of a waveform\n(the audio signal), the sample rate, the utterance (label), the ID of\nthe speaker, the number of the utterance.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Shape of waveform: {}\".format(waveform.size()))\nprint(\"Sample rate of waveform: {}\".format(sample_rate))\n\nplt.plot(waveform.t().numpy());"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\u2019s find the list of labels available in the dataset.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "labels = sorted(list(set(datapoint[2] for datapoint in train_set)))\nlabels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The 35 audio labels are commands that are said by users. The first few\nfiles are people saying \u201cmarvin\u201d.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "waveform_first, *_ = train_set[0]\nipd.Audio(waveform_first.numpy(), rate=sample_rate)\n\nwaveform_second, *_ = train_set[1]\nipd.Audio(waveform_second.numpy(), rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The last file is someone saying \u201cvisual\u201d.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "waveform_last, *_ = train_set[-1]\nipd.Audio(waveform_last.numpy(), rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Formatting the Data\n\nThis is a good place to apply transformations to the data. For the\nwaveform, we downsample the audio for faster processing without losing\ntoo much of the classification power.\n\nWe don\u2019t need to apply other transformations here. It is common for some\ndatasets though to have to reduce the number of channels (say from\nstereo to mono) by either taking the mean along the channel dimension,\nor simply keeping only one of the channels. Since SpeechCommands uses a\nsingle channel for audio, this is not needed here.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "new_sample_rate = 8000\ntransform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=new_sample_rate)\ntransformed = transform(waveform)\n\nipd.Audio(transformed.numpy(), rate=new_sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We are encoding each word using its index in the list of labels.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def label_to_index(word):\n    # Return the position of the word in labels\n    return torch.tensor(labels.index(word))\n\n\ndef index_to_label(index):\n    # Return the word corresponding to the index in labels\n    # This is the inverse of label_to_index\n    return labels[index]\n\n\nword_start = \"yes\"\nindex = label_to_index(word_start)\nword_recovered = index_to_label(index)\n\nprint(word_start, \"-->\", index, \"-->\", word_recovered)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To turn a list of data point made of audio recordings and utterances\ninto two batched tensors for the model, we implement a collate function\nwhich is used by the PyTorch DataLoader that allows us to iterate over a\ndataset by batches. Please see [the\ndocumentation](https://pytorch.org/docs/stable/data.html#working-with-collate-fn)_\nfor more information about working with a collate function.\n\nIn the collate function, we also apply the resampling, and the text\nencoding.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def pad_sequence(batch):\n    # Make all tensor in a batch the same length by padding with zeros\n    batch = [item.t() for item in batch]\n    batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)\n    return batch.permute(0, 2, 1)\n\n\ndef collate_fn(batch):\n\n    # A data tuple has the form:\n    # waveform, sample_rate, label, speaker_id, utterance_number\n\n    tensors, targets = [], []\n\n    # Gather in lists, and encode labels as indices\n    for waveform, _, label, *_ in batch:\n        tensors += [waveform]\n        targets += [label_to_index(label)]\n\n    # Group the list of tensors into a batched tensor\n    tensors = pad_sequence(tensors)\n    targets = torch.stack(targets)\n\n    return tensors, targets\n\n\nbatch_size = 256\n\nif device == \"cuda\":\n    num_workers = 1\n    pin_memory = True\nelse:\n    num_workers = 0\n    pin_memory = False\n\ntrain_loader = torch.utils.data.DataLoader(\n    train_set,\n    batch_size=batch_size,\n    shuffle=True,\n    collate_fn=collate_fn,\n    num_workers=num_workers,\n    pin_memory=pin_memory,\n)\ntest_loader = torch.utils.data.DataLoader(\n    test_set,\n    batch_size=batch_size,\n    shuffle=False,\n    drop_last=False,\n    collate_fn=collate_fn,\n    num_workers=num_workers,\n    pin_memory=pin_memory,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Define the Network\n\nFor this tutorial we will use a convolutional neural network to process\nthe raw audio data. Usually more advanced transforms are applied to the\naudio data, however CNNs can be used to accurately process the raw data.\nThe specific architecture is modeled after the M5 network architecture\ndescribed in [this paper](https://arxiv.org/pdf/1610.00087.pdf)_. An\nimportant aspect of models processing raw audio data is the receptive\nfield of their first layer\u2019s filters. Our model\u2019s first filter is length\n80 so when processing audio sampled at 8kHz the receptive field is\naround 10ms (and at 4kHz, around 20 ms). This size is similar to speech\nprocessing applications that often use receptive fields ranging from\n20ms to 40ms.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "class M5(nn.Module):\n    def __init__(self, n_input=1, n_output=35, stride=16, n_channel=32):\n        super().__init__()\n        self.conv1 = nn.Conv1d(n_input, n_channel, kernel_size=80, stride=stride)\n        self.bn1 = nn.BatchNorm1d(n_channel)\n        self.pool1 = nn.MaxPool1d(4)\n        self.conv2 = nn.Conv1d(n_channel, n_channel, kernel_size=3)\n        self.bn2 = nn.BatchNorm1d(n_channel)\n        self.pool2 = nn.MaxPool1d(4)\n        self.conv3 = nn.Conv1d(n_channel, 2 * n_channel, kernel_size=3)\n        self.bn3 = nn.BatchNorm1d(2 * n_channel)\n        self.pool3 = nn.MaxPool1d(4)\n        self.conv4 = nn.Conv1d(2 * n_channel, 2 * n_channel, kernel_size=3)\n        self.bn4 = nn.BatchNorm1d(2 * n_channel)\n        self.pool4 = nn.MaxPool1d(4)\n        self.fc1 = nn.Linear(2 * n_channel, n_output)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = F.relu(self.bn1(x))\n        x = self.pool1(x)\n        x = self.conv2(x)\n        x = F.relu(self.bn2(x))\n        x = self.pool2(x)\n        x = self.conv3(x)\n        x = F.relu(self.bn3(x))\n        x = self.pool3(x)\n        x = self.conv4(x)\n        x = F.relu(self.bn4(x))\n        x = self.pool4(x)\n        x = F.avg_pool1d(x, x.shape[-1])\n        x = x.permute(0, 2, 1)\n        x = self.fc1(x)\n        return F.log_softmax(x, dim=2)\n\n\nmodel = M5(n_input=transformed.shape[0], n_output=len(labels))\nmodel.to(device)\nprint(model)\n\n\ndef count_parameters(model):\n    return sum(p.numel() for p in model.parameters() if p.requires_grad)\n\n\nn = count_parameters(model)\nprint(\"Number of parameters: %s\" % n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will use the same optimization technique used in the paper, an Adam\noptimizer with weight decay set to 0.0001. At first, we will train with\na learning rate of 0.01, but we will use a ``scheduler`` to decrease it\nto 0.001 during training after 20 epochs.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=0.0001)\nscheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.1)  # reduce the learning after 20 epochs by a factor of 10"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Training and Testing the Network\n\nNow let\u2019s define a training function that will feed our training data\ninto the model and perform the backward pass and optimization steps. For\ntraining, the loss we will use is the negative log-likelihood. The\nnetwork will then be tested after each epoch to see how the accuracy\nvaries during the training.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def train(model, epoch, log_interval):\n    model.train()\n    for batch_idx, (data, target) in enumerate(train_loader):\n\n        data = data.to(device)\n        target = target.to(device)\n\n        # apply transform and model on whole batch directly on device\n        data = transform(data)\n        output = model(data)\n\n        # negative log-likelihood for a tensor of size (batch x 1 x n_output)\n        loss = F.nll_loss(output.squeeze(), target)\n\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n\n        # print training stats\n        if batch_idx % log_interval == 0:\n            print(f\"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\\tLoss: {loss.item():.6f}\")\n\n        # update progress bar\n        pbar.update(pbar_update)\n        # record loss\n        losses.append(loss.item())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that we have a training function, we need to make one for testing\nthe networks accuracy. We will set the model to ``eval()`` mode and then\nrun inference on the test dataset. Calling ``eval()`` sets the training\nvariable in all modules in the network to false. Certain layers like\nbatch normalization and dropout layers behave differently during\ntraining so this step is crucial for getting correct results.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def number_of_correct(pred, target):\n    # count number of correct predictions\n    return pred.squeeze().eq(target).sum().item()\n\n\ndef get_likely_index(tensor):\n    # find most likely label index for each element in the batch\n    return tensor.argmax(dim=-1)\n\n\ndef test(model, epoch):\n    model.eval()\n    correct = 0\n    for data, target in test_loader:\n\n        data = data.to(device)\n        target = target.to(device)\n\n        # apply transform and model on whole batch directly on device\n        data = transform(data)\n        output = model(data)\n\n        pred = get_likely_index(output)\n        correct += number_of_correct(pred, target)\n\n        # update progress bar\n        pbar.update(pbar_update)\n\n    print(f\"\\nTest Epoch: {epoch}\\tAccuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\\n\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we can train and test the network. We will train the network\nfor ten epochs then reduce the learn rate and train for ten more epochs.\nThe network will be tested after each epoch to see how the accuracy\nvaries during the training.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "log_interval = 20\nn_epoch = 2\n\npbar_update = 1 / (len(train_loader) + len(test_loader))\nlosses = []\n\n# The transform needs to live on the same device as the model and the data.\ntransform = transform.to(device)\nwith tqdm(total=n_epoch) as pbar:\n    for epoch in range(1, n_epoch + 1):\n        train(model, epoch, log_interval)\n        test(model, epoch)\n        scheduler.step()\n\n# Let's plot the training loss versus the number of iteration.\n# plt.plot(losses);\n# plt.title(\"training loss\");"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The network should be more than 65% accurate on the test set after 2\nepochs, and 85% after 21 epochs. Let\u2019s look at the last words in the\ntrain set, and see how the model did on it.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def predict(tensor):\n    # Use the model to predict the label of the waveform\n    tensor = tensor.to(device)\n    tensor = transform(tensor)\n    tensor = model(tensor.unsqueeze(0))\n    tensor = get_likely_index(tensor)\n    tensor = index_to_label(tensor.squeeze())\n    return tensor\n\n\nwaveform, sample_rate, utterance, *_ = train_set[-1]\nipd.Audio(waveform.numpy(), rate=sample_rate)\n\nprint(f\"Expected: {utterance}. Predicted: {predict(waveform)}.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let\u2019s find an example that isn\u2019t classified correctly, if there is one.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for i, (waveform, sample_rate, utterance, *_) in enumerate(test_set):\n    output = predict(waveform)\n    if output != utterance:\n        ipd.Audio(waveform.numpy(), rate=sample_rate)\n        print(f\"Data point #{i}. Expected: {utterance}. Predicted: {output}.\")\n        break\nelse:\n    print(\"All examples in this dataset were correctly classified!\")\n    print(\"In this case, let's just look at the last data point\")\n    ipd.Audio(waveform.numpy(), rate=sample_rate)\n    print(f\"Data point #{i}. Expected: {utterance}. Predicted: {output}.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Feel free to try with one of your own recordings of one of the labels!\nFor example, using Colab, say \u201cGo\u201d while executing the cell below. This\nwill record one second of audio and try to classify it.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def record(seconds=1):\n\n    from google.colab import output as colab_output\n    from base64 import b64decode\n    from io import BytesIO\n    from pydub import AudioSegment\n\n    RECORD = (\n        b\"const sleep  = time => new Promise(resolve => setTimeout(resolve, time))\\n\"\n        b\"const b2text = blob => new Promise(resolve => {\\n\"\n        b\"  const reader = new FileReader()\\n\"\n        b\"  reader.onloadend = e => resolve(e.srcElement.result)\\n\"\n        b\"  reader.readAsDataURL(blob)\\n\"\n        b\"})\\n\"\n        b\"var record = time => new Promise(async resolve => {\\n\"\n        b\"  stream = await navigator.mediaDevices.getUserMedia({ audio: true })\\n\"\n        b\"  recorder = new MediaRecorder(stream)\\n\"\n        b\"  chunks = []\\n\"\n        b\"  recorder.ondataavailable = e => chunks.push(e.data)\\n\"\n        b\"  recorder.start()\\n\"\n        b\"  await sleep(time)\\n\"\n        b\"  recorder.onstop = async ()=>{\\n\"\n        b\"    blob = new Blob(chunks)\\n\"\n        b\"    text = await b2text(blob)\\n\"\n        b\"    resolve(text)\\n\"\n        b\"  }\\n\"\n        b\"  recorder.stop()\\n\"\n        b\"})\"\n    )\n    RECORD = RECORD.decode(\"ascii\")\n\n    print(f\"Recording started for {seconds} seconds.\")\n    display(ipd.Javascript(RECORD))\n    s = colab_output.eval_js(\"record(%d)\" % (seconds * 1000))\n    print(\"Recording ended.\")\n    b = b64decode(s.split(\",\")[1])\n\n    fileformat = \"wav\"\n    filename = f\"_audio.{fileformat}\"\n    AudioSegment.from_file(BytesIO(b)).export(filename, format=fileformat)\n    return torchaudio.load(filename)\n\n\n# Detect whether notebook runs in google colab\nif \"google.colab\" in sys.modules:\n    waveform, sample_rate = record()\n    print(f\"Predicted: {predict(waveform)}.\")\n    ipd.Audio(waveform.numpy(), rate=sample_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nIn this tutorial, we used torchaudio to load a dataset and resample the\nsignal. We have then defined a neural network that we trained to\nrecognize a given command. There are also other data preprocessing\nmethods, such as finding the mel frequency cepstral coefficients (MFCC),\nthat can reduce the size of the dataset. This transform is also\navailable in torchaudio as ``torchaudio.transforms.MFCC``.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}